PARTIE: a partition engine to separate metagenomic and amplicon projects in the Sequence Read Archive
نویسندگان
چکیده
Motivation The Sequence Read Archive (SRA) contains raw data from many different types of sequence projects. As of 2017, the SRA contained approximately ten petabases of DNA sequence (10 16 bp). Annotations of the data are provided by the submitter, and mining the data in the SRA is complicated by both the amount of data and the detail within those annotations. Here, we introduce PARTIE, a partition engine optimized to differentiate sequence read data into metagenomic (random) and amplicon (targeted) sequence data sets. Results PARTIE subsamples reads from the sequencing file and calculates four different statistics: k -mer frequency, 16S abundance, prokaryotic- and viral-read abundance. These metrics are used to create a RandomForest decision tree to classify the sequencing data, and PARTIE provides mechanisms for both supervised and unsupervised classification. We demonstrate the accuracy of PARTIE for classifying SRA data, discuss the probable error rates in the SRA annotations and introduce a resource assessing SRA data. Availability and Implementation PARTIE and reclassified metagenome SRA entries are available from https://github.com/linsalrob/partie. Contact [email protected]. Supplementary information Supplementary data are available at Bioinformatics online.
منابع مشابه
Metagenomic exploration of the bacterial community structure at Paradip Port, Odisha, India
This is a pioneering report on the metagenomic exploration of the bacterial diversity from a busy sea port in Paradip, Odisha, India. In our study, high-throughput sequencing of community 16S rRNA gene amplicon was performed using 454 GS Junior platform. Metagenome contain 34,121 sequences with 16,677,333 bp and 56.3% G + C content. Metagenome sequences data are now available at NCBI under the ...
متن کاملMetagenomic data of the bacterial community in coastal Gulf of Mexico sediment microcosms following exposure to Macondo oil (MC252)
The data in this article includes the sequences of bacterial 16S rRNA gene from metagenome of Macondo oil (MC252)-treated and non-oil-treated sediment microcosms, collected from coastal Gulf of Mexico and Bayou La Batre, USA. Metacommunity DNA was PCR amplified with 341F and 907R oligonucleotide primers, targeting V3-V5 regions of the 16S rRNA gene. Data were generated by using bacterial tag-en...
متن کاملGrinder: a versatile amplicon and shotgun sequence simulator
We introduce Grinder (http://sourceforge.net/projects/biogrinder/), an open-source bioinformatic tool to simulate amplicon and shotgun (genomic, metagenomic, transcriptomic and metatranscriptomic) datasets from reference sequences. This is the first tool to simulate amplicon datasets (e.g. 16S rRNA) widely used by microbial ecologists. Grinder can create sequence libraries with a specific commu...
متن کاملMetagenomic analysis of fungal taxa inhabiting Mecca region, Saudi Arabia
The data presented contains the sequences of fungal Internal Transcribed Spacer (ITS) and 18S rRNA gene from a metagenome of the Mecca region, Saudi Arabia. Sequences were amplified using fungal specific primers, which amplified the amplicon aligned between the 18S and 28S rRNA genes. A total of 460 fungal species belonging to 133 genera, 58 families, 33 orders, 13 classes and 4 phyla were iden...
متن کاملFunctional assignment of metagenomic data: challenges and applications
Metagenomic sequencing provides a unique opportunity to explore earth's limitless environments harboring scores of yet unknown and mostly unculturable microbes and other organisms. Functional analysis of the metagenomic data plays a central role in projects aiming to explore the most essential questions in microbiology, namely 'In a given environment, among the microbes present, what are they d...
متن کامل